Model Based Sampling for Admissible Quantification of Model Uncertainty

University of Wisconsin-Madison

Merlise Clyde

Duke University

Outline

  • Canonical Regression Model & Bayesian Model Uncertainty

  • Estimation via MCMC Monte Carlo Frequencies

  • Probability Proportional to Size Sampling in Finite Populations

  • Adaptive Independent Metropolis/Adaptive Importance Sampling

  • Illustration

Canonical Regression Model

  • Observe response vector \(\mathbf{Y}\) with predictor variables \(X_1 \dots X_p\).

  • Model for data under a specific model \({\cal M}_\gamma\): \[\begin{equation*} \mathbf{Y}\mid \alpha, \boldsymbol{\beta_\gamma}, \phi, {\cal M}_\gamma\sim \textsf{N}(\mathbf{1}_n\alpha + \mathbf{X}_{\boldsymbol{\gamma}}\boldsymbol{\beta_\gamma}, \mathbf{I}_n/\phi) \label{eq:linear.model} \end{equation*}\]

  • Models \({\cal M}_\gamma\) encoded by \(\boldsymbol{\gamma}= (\gamma_1, \ldots \gamma_p)^T\) binary vector with \(\gamma_j = 1\) indicating that \(X_j\) is included in model \({\cal M}_\gamma\) where \[\begin{align*} \gamma_j = 0 & \Leftrightarrow \beta_j = 0 \\ \gamma_j = 1 & \Leftrightarrow \beta_j \neq 0 \end{align*}\]

  • \(\mathbf{X}_{\boldsymbol{\gamma}}\): the \(n \times p_{\boldsymbol{\gamma}}\) design matrix for model \({\cal M}_\gamma\)

  • \(\boldsymbol{\beta_\gamma}\): the \(p_{\boldsymbol{\gamma}}\) vector of non-zero regression coefficients under \({\cal M}_\gamma\)

  • intercept \(\alpha\), precision \(\phi\) common to all models

Bayesian Prior Distributions

  • \(2^p\) models in the model space \(\boldsymbol{\Gamma}\)

  • prior distributions on all unknowns \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\equiv (\boldsymbol{\beta}_{{\cal M}_\gamma}, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) and \({\cal M}_\gamma\)

  • reference priors for \(\alpha_{{\cal M}_\gamma}\), \(\phi_{{\cal M}_\gamma}\): \(p(\alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma} \mid {\cal M}_\gamma) \propto 1/\phi_{{\cal M}_\gamma}\)

  • “conventional” priors for \(\boldsymbol{\beta}_{{\cal M}_\gamma} \mid \phi, {\cal M}_\gamma\)

    • independent “spike and slab” priors
    • “spike and multivariate slab” priors as mixtures of Zellner’s \(g\)-priors
  • Mixtures of g-priors: \[\begin{align*} \boldsymbol{\beta}_{{\cal M}_\gamma} \mid g, \phi, {\cal M}_\gamma& \sim \textsf{N}\left(\mathbf{0}, g (\mathbf{X}_{\boldsymbol{\gamma}}^T \mathbf{X}_{\boldsymbol{\gamma}})^{-1}/\phi \right) \\ g \mid {\cal M}_\gamma& \sim \pi(g \mid {\cal M}_\gamma) \end{align*}\]

    • \(g\)-prior: point mass for \(g\) (\(g = n\), or Empirical Bayes estimate of \(g\) given \({\cal M}_\gamma\))
    • Zellner-Siow Cauchy prior: \(1/g \sim \textsf{Gamma}(n/2, 1/2)\)
    • hyper-\(g\)-prior: \(\frac{g}{1 + g} \sim \textsf{Beta}(a, b)\), etc

Marginal Likelihoods of Models

  • for mixtures of \(g\)-priors, we can integrate out the parameters \(\boldsymbol{\theta}_{\boldsymbol{\gamma}}= (\boldsymbol{\beta_\gamma}, \alpha_{{\cal M}_\gamma}, \phi_{{\cal M}_\gamma})\) given \(g\) in closed form \[p(\mathbf{Y}\mid {\cal M}_\gamma, g) = \int p(\mathbf{Y}\mid \boldsymbol{\theta}_{\boldsymbol{\gamma}}, {\cal M}_\gamma)p(\boldsymbol{\theta}_{\boldsymbol{\gamma}}\mid g, {\cal M}_\gamma) d\boldsymbol{\theta}_{\boldsymbol{\gamma}}\] \[p(\mathbf{Y}\mid {{\cal M}}_{\boldsymbol{\gamma}}, g) = C (1+g)^{(n-1-p_{\boldsymbol{\gamma}})/2} \big [ 1 + g(1 - R^2_{\boldsymbol{\gamma}} \big]^{-(n-1)/2}\]

  • marginal likelihood of \({\cal M}_\gamma\) is a one-dimensional integral (Laplace approximation or numerical) proportional to \[p(\mathbf{Y}\mid {\cal M}_\gamma) = \int p(\mathbf{Y}\mid g, {\cal M}_\gamma)\pi(g \mid {\cal M}_\gamma) dg\]

  • marginal likelihoods for models under mixtures of \(g\)-priors expressed as functions of \(F\)-statistics for testing \(\boldsymbol{\beta_\gamma}= 0\), \(n\) and \(p_{\boldsymbol{\gamma}}\)

  • provides a calibration of p-values via posterior probabilities of models

Posterior Model Probabilities

\[p({\cal M}_\gamma\mid\ Y) = \frac {p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)} {\sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p(\mathbf{Y}\mid {\cal M}_\gamma)p({\cal M}_\gamma)}\]

  • posterior distribution of quantities \(\Delta\) of interest under BMA \[ \sum_{\boldsymbol{\gamma}\in \boldsymbol{\Gamma}} p({\cal M}_\gamma\mid \mathbf{Y}) p(\Delta \mid \mathbf{Y}, {\cal M}_\gamma) \]
    • estimation \(\textsf{E}[\boldsymbol{\mu}= \mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]\)
    • predictive distributions \(p(\mathbf{Y}^* \mid \mathbf{Y})\), \(\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]\)
    • marginal inclusion probabilities \(P(\gamma_j = 1 \mid \mathbf{Y})\)
    • model selection

Can’t enumerate all models!

MCMC Sampling from Posteriors

Use a sample of models from \(\boldsymbol{\Gamma}\) to approximate the posterior distribution of models

  • design a Markov Chain to transition through \(\boldsymbol{\Gamma}\) with stationary distribution \(p({\cal M}_\gamma\mid \mathbf{Y})\) \[ p({\cal M}_\gamma\mid \mathbf{Y}) \propto p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma) \]

  • propose a new model from \(q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})\)

  • accept moving to \(\boldsymbol{\gamma}^*\) with probability \[ \textsf{MH} = \max(1, \frac{p({\cal M}_\gamma* \mid \mathbf{Y})p({\cal M}_\gamma^*)/q(\boldsymbol{\gamma}^* \mid \boldsymbol{\gamma})} {p({\cal M}_\gamma\mid \mathbf{Y})p({\cal M}_\gamma)/q(\boldsymbol{\gamma}\mid \boldsymbol{\gamma}^*)}) \]

  • otherwise stay at model \({\cal M}_\gamma\)

  • models are sampled proportional to their posterior probabilities as \(T \to \infty\)

Estimation in BMA

Estimate the probabilities of models via Monte Carlo frequencies of models or ergodic averages \[\begin{align*} \widehat{p({\cal M}_\gamma\mid \mathbf{Y})} & = \frac{\sum_{t = 1}^T I({{\cal M}}_t = {\cal M}_\gamma)} {T} \\ & = \frac{\sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}} I({\cal M}_\gamma\in S)} {\sum n_{\boldsymbol{\gamma}}} \\ \end{align*}\]

  • \(T\) = # MCMC samples

  • \(S\) is the collection of unique sampled models

  • \(n_{\boldsymbol{\gamma}}\) is the frequency of model \({\cal M}_\gamma\) in \(S\)

  • \(n = \sum_{\boldsymbol{\gamma}\in S} n_{\boldsymbol{\gamma}}\) total number of unique models in the sample

  • asymptotically unbiased as \(T \to \infty\)

Current Status

  • BMA/BVS has a reputation of being slow and computationally intensive!

  • Porwal and Raftery (PNAS 2022) compare a range of state of the art methods and standards in high-dimensional problems

  • 21 methods based on available R packages

    • BMA using a range of g-priors or mixtures of g-priors
    • BIC
    • Lasso
    • SCAD
    • Elastic Net
    • Horseshoe
    • EMVS
    • Spike & Slab Lasso
  • BMA methods are competitive with Lasso in terms of computational speed and better than Lasso and others in terms of prediction and model selection!

Issues with Monte Carlo Frequencies

  • fundamentally unsound to a Bayesian ! (O’Hagan 1987, The Statistician)

  • in high-dimensional problems many models are rarely or never sampled (high variance but unbiased)

  • ignores observed information in the marginal likelihoods \(\times\) prior probabilities!

  • alternatives based on estimating the normalizing constant \[C = \sum_{\boldsymbol{\gamma}\in {\cal{S}}} p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)\]

  • renormalized observed marginal likelihoods \(\times\) prior probabilities of sampled models in \({\cal{S}}\)

    • exact posterior odds and lower variance,
    • but can have higher bias if a large fraction of models are not sampled!
  • can we do better using ideas from Finite Population Sampling?

MCMC and Finite Population Sampling

  • Can view Metropolis-Hasting sampling from \(\boldsymbol{\Gamma}\) (in the limit) as a form of Probability Proportional to Size Sampling (PPS) With Replacement from \(\boldsymbol{\Gamma}\)

  • Let \(q({{\cal M}}_i)\) be the probability of sampling \({{\cal M}}_i\)

  • Goal is to estimate \[C = \sum_i^N p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\] where \(N = |\boldsymbol{\Gamma}|\)

  • Finite Population Sampling Estimators

    • Hansen-Hurwitz (HH) (sampling with replacement)
    • Horvitz-Thompson (HT) (any sampling design*)

Hansen-Hurwitz (HH)

  • Hansen-Hurwitz (1943) may be viewed as an importance sampling estimate \[\hat{C} = \frac{1}{n}\sum_{i=1}^n \frac{ n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)} \]

  • If we have “perfect” samples from the posterior then \(q({{\cal M}}_i) = \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{C}\) and recover \(C\)!

  • Since \(C\) is unknown, apply the ratio HH estimator (or self-normalized IS) \[ \hat{C} = \frac{\frac1 n \sum_i^n \frac{n_i p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)}{q({{\cal M}}_i)}}{ \frac1 n \sum_i^n \frac{1}{q({{\cal M}}_i)}} = \left[ \frac{1}{n} \sum_i \frac{n_i}{p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)} \right]^{-1} \]

This recovers the harmonic mean estimator of Newton & Raftery (1994) - while unbiased, it’s is highly unstable!

Horvitz-Thompson (HT)

  • FPS estimates such as HH that depend on MC frequencies

    • violate the Likelihood principle (Basu, 1988)
    • are inadmissible !
  • HT estimate of normalizing constant: \[\quad \hat{C} = \frac{1}{n} \sum_{i \in n} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)} {\pi_i}\]

  • \(\pi_i\) is the inclusion probability that \(\boldsymbol{\gamma}_i \in S\)

  • under sampling with replacement \(\pi_i = 1 - (1 - q({{\cal M}}_i))^\text{T}\)

  • Horvitz-Thompson dominates Hansen-Hurvitz (smaller variance) and is the unique hyper-admissible estimate of \(C\) (Joshi, 1972; Ramakrishnan, 1973)

Basu and Bayes

Basu’s (1971) famous circus example illustrated potential problems with the Horvitz-Thompson estimator (similar problem arises with IS)

  • violates the likelihood principle

  • once we have samples, \(p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) are fixed and the sampling probabilities are not relevant

  • only randomness is for the remaining units that were not sampled. (which is related to the sampling design)

  • Basu’s estimate under SWOP (using \(\pi_i \propto A_i = q({{\cal M}}_i)\)), \[C = \sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \left( \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i)p({{\cal M}}_i)}{\pi_i} \right) \times \left(\sum_{i \notin S} \pi_i \right)\]

  • conditions on the observed data sum and estimates remaining

Model Based Methods

Basu (1971)’s estimate of the total can be justified as a “super-population” Model Based approach (Meeden and Ghosh, 1983)

  • Let \(m_i = p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i)\) \[\begin{align} m_i \mid \pi_i &\mathrel{\mathop{\sim}\limits^{\rm ind}}N(\pi_i \eta, \tau^2 \pi_i^2) \\ p(\eta, \sigma^2) & \propto 1/\tau^2 \end{align}\]

  • posterior mean of \(\eta\) is \(\hat{\eta} = \frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i}\) (the HT of the total)

  • using the posterior predictive for \(m_i \notin S\), \(\textsf{E}[m_i \mid m_j \in S] = \pi_i \hat{\eta}\) \[\begin{align*} C & = \sum_{i \in \boldsymbol{\Gamma}} m_i = \sum_{i \in S} m_i + \sum_{i \notin S} m_i \\ \hat{C} & = \sum_{i \in S} m_i + \sum_{i \notin S} \hat{\eta} \pi_i = \sum_{i \in S} m_i + \left[\frac{1}{n} \sum_{i \in S} \frac{m_i}{\pi_i} \right] \sum_{i \notin S} \pi_i \end{align*}\]

Final Posterior Estimates

  • estimate of posterior probability \({\cal M}_\gamma\) for \({\cal M}_\gamma\in S\) \[ \frac{p(\mathbf{Y}\mid {\cal M}_\gamma) p({\cal M}_\gamma)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]

  • estimate of all models in \(\boldsymbol{\Gamma}- S\) from the predictive distribution \[ \frac{\frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)}{\sum_{i \in S} p(\mathbf{Y}\mid {{\cal M}}_i) p({{\cal M}}_i) + \frac{1}{n} \sum_{i \in S} \frac{p(\mathbf{Y}\mid {{\cal M}}_i) p(\boldsymbol{{\cal M}}_i)}{\pi_i}\sum_{i \in S} (1 - \pi_i)} \]

  • Uses renormalized marginal likelihoods of sampled models

  • easy to compute marginal inclusion probabilities

  • Can also derive \(\textsf{E}[\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{X}\boldsymbol{\beta}\mid \mathbf{Y}]\), \(\textsf{E}[\mathbf{Y}^* \mid \mathbf{Y}]\) or \(p(\Delta \mid \mathbf{Y})\)

  • still have to come up with a good \(q({{\cal M}}_i)\)!

Adaptive Independent MH to the Rescue?

Griffin et al (2021):

  • Independent Metropolis Hastings \[q(\boldsymbol{\gamma}) = \prod \pi_i^{\gamma_i} (1 - \pi_i)^{1-\gamma_i}\]

  • Product of Independent Bernoullis is optimal for target where \(\gamma_j\) are independent a posteriori

  • Use adaptive Independent Metropolis Hastings to learn \(\pi_i\) from past samples

  • uses Metropolis-Hastings Ratio to accept proposals + Monte Carlo frequencies to estimate posterior model probabilities

  • poor approximation for standalone sampling with replacement/importance sampling with increasing correlation in \(\mathbf{X}\)

Choice for \(q({\cal M}_\gamma)\) ?

  • The joint posterior distribution of \(\boldsymbol{\gamma}\) (dropping \(\mathbf{Y}\)) may be factored: \[p({\cal M}_\gamma\mid \mathbf{Y}) \equiv p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j = 1}^p p(\gamma_j \mid \boldsymbol{\gamma}_{<j}) \] where \(\boldsymbol{\gamma}_{< j}\equiv \{\gamma_k\}\) for \(k < j\) and \(p(\gamma_1 \mid \boldsymbol{\gamma}_{<1}) \equiv p(\gamma_1)\).

  • As \(\gamma_j\) are binary, re-express as \[\begin{equation*} p(\boldsymbol{\gamma}\mid \mathbf{Y}) = \prod_{j=1}^p(\rho_{j \mid <j})^{\gamma_j}{(1-\rho_{j \mid <j})}^{1-\gamma_j} \end{equation*}\] where \(\rho_{j \mid <j} \equiv \Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j})\) and \(\rho_{1 \mid < 1} = \rho_1\), the marginal probability.

  • Product of Dependent Bernoullis

Global Adaptive MCMC Proposal

Factor proposal \[q(\boldsymbol{\gamma}) = \prod_{j = 1}^p q(\gamma_j \mid \boldsymbol{\gamma}_{<j}) = \prod_j \textsf{Ber}(\hat{\rho}_{j \mid <j}) \]

  • Note: \(\Pr(\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}) = \textsf{E}[\gamma_j = 1 \mid \boldsymbol{\gamma}_{<j}]\)

  • Fit a sequence of \(p\) regressions \(\gamma_j\) on \(\gamma_{<j}\) \[\begin{align*} \gamma_1 & = \mu_1 + \epsilon_1 \\ \gamma_2 & = \mu_2 + \beta_{2 1} (\gamma_1 - \mu_1) + \epsilon_2 \\ \gamma_3 & = \mu_3 + \beta_{3 1} (\gamma_1 - \mu_1) + \beta_{3 2} (\gamma_2 - \mu_2) + \epsilon_3 \\ & \vdots \\ \gamma_p & = \mu_p + \beta_{p 1} (\gamma_1 - \mu_1) \ldots + \beta_{p-1 \, p-1} (\gamma_{p-1} - \mu_{p-1})+ \epsilon_p \end{align*}\]

Compositional Regression

Approximate model \[\boldsymbol{\gamma}\sim \textsf{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma}_{\boldsymbol{\gamma}})\]

  • Wermouth (1980) compositional regression \[ \mathbf{G}= \mathbf{1}_{T} \boldsymbol{\mu}^T + (\boldsymbol{\Gamma}- \mathbf{1}_T \boldsymbol{\mu}^T) \mathbf{B}+ \boldsymbol{\epsilon} \]

  • \(\mathbf{G}\) is \(T \times p\) matrix where row \(t\) is \(\boldsymbol{\gamma}_t\)

  • \(\boldsymbol{\mu}\) is the \(p\) dimensional vector of \(\textsf{E}[\boldsymbol{\gamma}]\)

  • \(\boldsymbol{\Sigma}_{\boldsymbol{\gamma}} = \mathbf{U}^T \mathbf{U}\) where \(\mathbf{U}\) is upper triangular Cholesky decomposition of covariance matrix of \(\boldsymbol{\gamma}\) (\(p \times p\))

  • \(\mathbf{B}^T = \mathbf{I}_p - diag(\mathbf{U})^{-1} \mathbf{U}^{-T}\) (lower triangle)

  • \(\mathbf{B}\) is a \(p \times p\) upper triangular matrix with zeros on the diagonal and regression coefficients for \(jth\) regression in row \(j\)

Estimators of \(\mathbf{B}\) and \(\boldsymbol{\mu}\)

  • OLS is BLUE and consistent, but \(\mathbf{G}\) may not be full rank

  • apply Bayesian Shrinkage with “priors” on \(\boldsymbol{\mu}\) (non-informative or Normal) and \(\Sigma\) (inverse-Wishart)

  • pseudo-posterior mean \(\boldsymbol{\mu}\) is the current estimate of the marginal inclusion probabilities \(\bar{\boldsymbol{\gamma}} = \hat{\boldsymbol{\mu}}\)

  • use pseudo-posterior mean for \(\boldsymbol{\Sigma}\)

  • one Cholesky decomposition provides all coefficients for the \(p\) predictions for proposing \(\boldsymbol{\gamma}^*\)

  • constrain predicted values \(\hat{\rho}_{j \mid <j} \in (\delta, 1-\delta)\)

  • generate \(\boldsymbol{\gamma}^*_j \mid \boldsymbol{\gamma}^*_{< j} \sim \textsf{Ber}(\hat{\rho}_{j \mid <j})\)

  • use as proposal for Adaptive Independent Metropolis-Hastings or Importance Sampling (Accept all) -or- Sampling Without Replacement (future)

Simulation

  • tecator data (Griffin et al (2021))

  • a sample of \(p = 20\) variables

  • compare enumeration to

    • MCMC with add, delete, and swap moves with \(q\)
    • Adaptive Independent MCMC
    • Importance Sampling with HT
    • MCMC+BAS (SWOR independent Bernoulli)
    • MCMC+BAS + Basu’s estimate of \(C\)

  • same settings for all methods for potentially equal number of model evaluations

MSE Comparision

Continued Adaptation ?

  • can update Cholesky with rank 1 updates with new models
  • how to combine IS with MH samples (weighting) ?
  • HT/Hajek - computational complexity involved if we need to compute inclusion probability for all models based on updates (previous models and future models)
  • Basu (1971) approach works with PPS-SWOR (sequential updates \(\pi_i\))

Refinements

  • Want to avoid MCMC for

    • pseudo Bayesian posteriors used to learn proposal distribution in sample design for models
    • estimation of posterior model probabilities in model-based approaches (ie learning \(\beta\), sampling from predictive distribution)
    • estimation of general quantities under BMA?
  • avoid infinite regret

  • more general models?

  • other mean/variance assumptions for the super-population model lead to other estimates for \(C\), \(p({\cal M}_\gamma\mid \mathbf{Y})\), etc

Summary

  • Adaptive Independent Metropolis proposal for models (use in more complex IS)
  • Use observed values of unique marginal likelihoods of models for estimating posterior distribution
  • Bayes estimates from MC output (solution to O’Hagan ’73?)

Thank You!

Code in development version of BAS on https://github.com/merliseclyde/BAS